Abstract
Here you can find how the processing and plotting related to our B cell receptor clonotype analysis was performed for the paper entitled “Multivalent Antigen Display on Nanoparticles Elicits Diverse and Broad B cell Responses”This Rmarkdown provides the analysis for the processed
data in the paper entitled “Multivalent Antigen Display on Nanoparticles
Diversifies B Cell Responses”. It includes the B cell receptors
analysis, plots, and generation of intermediate files for plotting using
other programs. The processed files used here are AIRR-compliant
datasets which are automatically downloaded from Zenodo (DOI: 00000000),
they are either Human Respiratory Syncytial Virus-specific
(RSV-specific) or the full bulk repertoire from the vaccinated non-human
primates (Macaca mulatta).
Loaded libraries need to be present, the snakemake
command will try to install them in the conda environment
library(jsonlite)
library(tidyr)
library(treeio)
library(ggtree)
library(rstatix)
library(ggpubr)
library(Biostrings)
library(data.table)
library(vegan)
library(ggplot2)
library(scales)
library(ggprism)
library(dplyr)
library(kableExtra)
source("df_to_fasta.R")
set.seed(123)
Some data used here is stored as .rds format, but it can
also be found in .tsv at Zenodo (DOI: 000000000).
data_comp <- read.csv("../data/ELISA_comp/2023-03-06_normalized_plasma_compt.csv") %>%
mutate(group = case_when(group == "NP 100%" ~ "20-mer",
group == "NP 50%" ~ "10-mer",
group == "Sol" ~ "1-mer",
group == "PostF" ~ "PostF"))
clonotype_rsv <- readRDS("../data/clonotypes/individualized/rsv-specific_clonotypes.rds")
# replace column names for AIRR-compliant names and filter out "PV" timepoint which was not used for the analysis
clonotype_rsv <- clonotype_rsv %>%
filter(timepoint != "PV") %>%
mutate(cdr3_aa_length = nchar(CDR3_aa),
cdr3_aa = CDR3_aa,
v_call = V_gene,
j_call = J_gene,
d_call = D_gene)
# set color for fill and color aes layers
fill_col_values <- c("20-mer" = "#5F90B0", "10-mer" = "#92CDD5", "1-mer" = "#D896C1", "PostF" = "grey50")
color_values <- c("20-mer" = "#2E6997", "10-mer" = "#469698", "1-mer" = "#BE3C8C", "PostF" = "grey20")
# load repertoire sequencing reads info
rep_seq_ls <- list.files("../data/processed_data/rep_seq", pattern = "stats.json", recursive = TRUE, full.names = TRUE)
rep_seq_names <- list.files("../data/processed_data/rep_seq", pattern = "stats.json", recursive = TRUE, full.names = FALSE)
names(rep_seq_ls) <- sub("_st","",gsub("/","_",substr(rep_seq_names, 1,10)))
The non-human primates plasma samples from this study were used for antibody competition ELISAs coated with pre-fusion or post-fusion proteins against previously characterized monoclonal antibodies. Multidimensional scaling aims to conserve distances between datapoints and/or samples from a set of variables. Thus, closer a point is to each other, closer their competition profile in this case. It is a good way to summarize in a 2D-space a large number of variables. The MDS input is a dissimilarity matrix, for this plot the input the matrix is based on cosine distance. Although the euclidean distance is the most used, we have decided to use the cosine distance here because the magnitude of responses are less important than the competition profile itself for us. Some NHPs were really good responders, thus having higher titers for all the competitions, thus euclidean distance would drive them far away from all the other NHPs because of their high antibody titer. Since cosine distance takes into account the distance as an angle rather than the value itself, it does not take into account the weight or the magnitude of the antibody titers.
cosine_distance <- function(x) {
#For cosine similarity matrix
Matrix <- x %>%
select(-c(timepoints, ID, group)) %>%
as.matrix()
sim <- Matrix / sqrt(rowSums(Matrix * Matrix))
sim <- sim %*% t(sim)
#Convert to cosine dissimilarity matrix (distance matrix).
D_sim <- as.dist(1 - sim)
mds <- D_sim %>%
cmdscale(3) %>%
as_tibble()
colnames(mds) <- c("Dim.1", "Dim.2", "Dim.3")
mds <- mds %>%
mutate(group = x$group,
timepoints = x$timepoints,
ID = x$ID)
return(mds)
}
# calculate cosine distance and generate MDS with and without PostF
mds <- cosine_distance(data_comp)
mds_no_postf <- data_comp %>% filter(group != "PostF") %>% cosine_distance()
ls_mds <- list(mds, mds_no_postf)
for(f in seq_along(ls_mds)){
# Plot and color by groups
mds_plot <- ls_mds[[f]]
plot <- mds_plot %>%
ggplot(aes(Dim.1, Dim.2, color = group, fill = group)) +
geom_point(size = 4, shape = 21) +
ggtitle(f) +
scale_fill_manual(values = fill_col_values) +
scale_color_manual(values = color_values) +
lims(x = c(-.6,.5), y = c(-.4, .4)) +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5),
legend.position = c(.1,.9)) +
facet_wrap(~timepoints)
print(plot)
ggsave(plot = plot, filename = paste0("../", result.dir, f,"_mds_cosine-distance.pdf"), device = "pdf", width = 6, height = 4)
}
Save table containing read information from the high-throughput sequencing. It includes number of raw reads per animal and number of processed reads with high quality assignment of HV and HJ genes using IgDiscover software as an IgBlast wrapper.
rep_seq_stat <- lapply(rep_seq_ls, function(x) fromJSON(txt = x ))
rep_seq_stat <- lapply(rep_seq_stat, as.data.frame)
rep_seq_stat <- data.table::rbindlist(rep_seq_stat, idcol = "ID", fill = TRUE)
rep_seq_stat <- rep_seq_stat %>%
select(c(ID,read_preprocessing.raw_reads, assignment_filtering.total, assignment_filtering.has_vj_assignment, 17:21))
write.table(rep_seq_stat, "../data/processed_data/summary_sequencing_table.tsv", row.names = FALSE, sep = "\t", quote = FALSE )
Intermediate files are commented out, so it will not save all of
them. One could uncomment them if all the intermediate files are needed.
Intermediate files were used to run Recon (v2.5) (Kaplinsky
& Arnaout, Nat Commmun, 2016) according to default
parameters, more about running Recon is described in the next section Recon estimates for
RSV-specific diversity.
# add column for loop of ID, timepoint and specificity
clonotype_rsv <- clonotype_rsv %>%
mutate(ID_timepoint_spec = paste0(ID_timepoint, "_", specificity_group))
# create empty lists for loop
filtered_animal_rsv <- list()
clonotype_summary_rsv <- list()
# loop to create summary and full clonotype files for saving and/or following analysis
for (animals in unique(clonotype_rsv$ID_timepoint_spec)) {
# save full clonotype files
filtered_animal_rsv[[animals]] <- clonotype_rsv %>%
filter(ID_timepoint_spec == animals) %>%
as.data.frame()
# write.csv(filtered_animal_rsv[[animals]], paste0("../", result.dir, animals, "_clonotype_full.csv"), row.names = FALSE)
# save summary files
clonotype_summary_rsv[[animals]] <- filtered_animal_rsv[[animals]] %>%
group_by(grp, timepoint, ID, specificity_group, ID_timepoint_spec) %>%
summarise(clonal_size = n(), first(v_call), first(j_call), first(cdr3_aa)) %>%
arrange(desc(clonal_size)) %>%
ungroup()
#write.csv(clonotype_summary_rsv[[animals]], paste0("../", result.dir, animals, "_rsv_clonotype_summary.csv"), row.names = FALSE)
}
# doing the same thing but for total, that means not account for specificities (PreF, DP,PostF)
for (animals in unique(clonotype_rsv$ID_timepoint)) {
# save summary files
clonotype_summary_rsv[[paste0(animals, "_total")]] <- clonotype_rsv %>%
filter(ID_timepoint == animals) %>%
as.data.frame() %>%
mutate(
specificity_group = "Total",
ID_timepoint_spec = paste0(ID_timepoint, "_", specificity_group)
) %>%
group_by(grp, timepoint, ID, specificity_group, ID_timepoint_spec) %>%
summarise(clonal_size = n(), first(v_call), first(j_call), first(cdr3_aa)) %>%
arrange(desc(clonal_size)) %>%
ungroup()
# write.csv(clonotype_summary_rsv[[animals]], paste0("../", result.dir, animals, "_rsv_clonotype_summary.csv"), row.names = FALSE)
}
# Save clonotype summary per animal, taking into account ID, timepoint and clonal group
filtered_animal_rsv_summary <- list()
for (animals in unique(clonotype_rsv$ID)) {
# save summary files
clonotype_rsv %>%
filter(ID == animals) %>%
as.data.frame() %>%
group_by(grp, timepoint, ID) %>%
summarise(clonal_size = n(), single_cells = first(sc_clone_grp), v_call = first(v_call), j_call = first(j_call), cdr3_aa = first(cdr3_aa)) %>%
arrange(desc(clonal_size)) %>%
ungroup()
# write.csv(paste0("../", result.dir, animals, "_rsv_clonotype_summary.csv"), row.names = FALSE)
}
to_recon_rsv <- data.table::rbindlist(clonotype_summary_rsv) %>%
select(clonal_size, ID_timepoint_spec) %>%
group_by(clonal_size, ID_timepoint_spec) %>%
summarise(size = n()) %>%
ungroup()
for (i in unique(to_recon_rsv$ID_timepoint_spec)) {
to_recon_table <- to_recon_rsv %>%
filter(ID_timepoint_spec == i) %>%
select(clonal_size, size)
# fwrite(to_recon_table, file = paste0("../", result.dir, i, "_rsv_file_to_recon.txt"), append = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE)
}
Species richness (Chao1) was calculated using the vegan
r package. The samples were first subsampled 100 times for each
animal/timepoint and then the species richness was estimated. The mean
of those 100x replicates were used for plotting.
Here is the processing and estimation of species richness:
rsv_chao <- lapply(clonotype_summary_rsv, function(x) select(x, "clonal_size"))
# subsample and replicate the subsampling 100 times for higher accuracy
chaox100 <- function(x, value_to_subsample) {
replicate(100, {
subsample <- vegan::rrarefy(x, value_to_subsample)
chao <- vegan::estimateR(subsample)
return(chao)
})
}
diff_spec_timepoints <- unique(substring(names(clonotype_summary_rsv), 5))
# subsample based on lowest total clonotype size per group
chao_list_results <- list()
for (spec_timepoint in diff_spec_timepoints) {
print(spec_timepoint)
rsv_filtered <- rsv_chao[grepl(spec_timepoint, names(rsv_chao))]
min_to_subsample <- min(unlist(lapply(rsv_filtered, colSums)))
chao_list_results[[spec_timepoint]] <- lapply(rsv_filtered, chaox100, min_to_subsample)
}
## [1] "Single-cell_DP"
## [1] "B1_DP"
## [1] "B2_DP"
## [1] "Single-cell_PreF"
## [1] "B1_PreF"
## [1] "B2_PreF"
## [1] "Single-cell_PostF"
## [1] "B1_PostF"
## [1] "B2_PostF"
## [1] "Single-cell_total"
## [1] "B1_total"
## [1] "B2_total"
change_names <- function(x) {
names(x) <- gsub("_.*", "", names(x))
x
}
# adjusted dataset for plotting
{
chao_results_df <- purrr::map(chao_list_results, ~ change_names(.x))
chao_results_df <- rbindlist(chao_results_df, use.names = TRUE, idcol = TRUE, fill = TRUE)
chao_results_df$algorithm <- rep(c("Obs", "Chao1", "Chao1_se", "ACE", "ACE_se"), nrow(chao_results_df) / 5)
# save intermediate file
chao_results_df %>%
filter(algorithm %in% c("Chao1", "ACE")) %>%
dplyr::rename(Timepoint_specificity = .id) %>%
write.csv(paste0("../", result.dir, "rsv_repertoire_diversity.csv"), row.names = FALSE)
# diversity mean of x100 replicates per animal
chao_results_df %>%
filter(algorithm %in% c("Chao1", "ACE")) %>%
dplyr::rename(Timepoint_specificity = .id) %>%
group_by(algorithm, Timepoint_specificity) %>%
summarise_all(.funs = mean) %>%
write.csv(paste0("../", result.dir, "rsv_repertoire_diversity_mean.csv"), row.names = FALSE)
chao_results_df <- tidyr::pivot_longer(chao_results_df, cols = 2:(length(chao_results_df) - 1), names_to = c("ID")) %>%
tidyr::separate(.id, c("Timepoint", "Specificity"), sep = "_") %>%
mutate(Group = plyr::mapvalues(
ID, c(
"E11", "E16", "E17", "E23", "E24",
"E12", "E14", "E18", "E21"
),
c(
"20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
"1-mer", "1-mer", "1-mer", "1-mer"
)
))
}
Here is the plotting and comparison between vaccinated group:
my_comparisons <- list(c("Total_B1_20-mer", "Total_B1_1-mer"),
c("Total_B2_20-mer", "Total_B2_1-mer"),
c("PreF_B1_20-mer", "PreF_B1_1-mer"),
c("PreF_B2_20-mer", "PreF_B2_1-mer"),
c("DP_B1_20-mer", "DP_B1_1-mer"),
c("DP_B2_20-mer", "DP_B2_1-mer"))
chao_results_df$Specificity[chao_results_df$Specificity == "total"] <- "Total"
chao_results_df <- chao_results_df %>%
filter(algorithm != "Chao1_se" & algorithm != "ACE_se") %>%
filter(algorithm == "Chao1", Timepoint != "PV", Timepoint != "Single-cell",) %>%
mutate(Group = plyr::mapvalues(
ID, c(
"E11", "E16", "E17", "E23", "E24",
"E12", "E14", "E18", "E21"
),
c(
"20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
"1-mer", "1-mer", "1-mer", "1-mer"
)
),
Group_specificity = paste(Specificity, Timepoint, Group, sep = "_"),
Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
group_by(ID, Group, Timepoint, Specificity, algorithm, Group_specificity) %>%
summarise(value = mean(value)) %>%
tidyr::drop_na() %>%
ungroup()
write.csv(chao_results_df, paste0("../", result.dir, "chao1_results_plot_values.csv"), row.names = FALSE)
stat.test <- chao_results_df %>%
wilcox_test(formula = value ~ Group_specificity,
comparisons = my_comparisons,
p.adjust.method = "fdr",
paired = FALSE) %>%
add_xy_position()
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
| .y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| value | Total_B1_20-mer | Total_B1_1-mer | 5 | 4 | 16 | 0.190 | 0.285 | ns | 272.3352 | Total_B1…. | 1 | 2 |
| value | Total_B2_20-mer | Total_B2_1-mer | 5 | 4 | 15 | 0.286 | 0.343 | ns | 301.8278 | Total_B2…. | 3 | 4 |
| value | PreF_B1_20-mer | PreF_B1_1-mer | 5 | 4 | 7 | 0.556 | 0.556 | ns | 331.3205 | PreF_B1_…. | 5 | 6 |
| value | PreF_B2_20-mer | PreF_B2_1-mer | 5 | 4 | 2 | 0.064 | 0.127 | ns | 360.8131 | PreF_B2_…. | 7 | 8 |
| value | DP_B1_20-mer | DP_B1_1-mer | 5 | 4 | 20 | 0.016 | 0.048 |
|
390.3058 | DP_B1_20…. | 9 | 10 |
| value | DP_B2_20-mer | DP_B2_1-mer | 5 | 4 | 20 | 0.016 | 0.048 |
|
419.7984 | DP_B2_20…. | 11 | 12 |
chao_results_df %>%
ggpubr::ggdotplot(x = "Group_specificity", y = "value",
fill = "Group",
color = "Group",
group = "Group_specificity",
legend = "none", size = 2) +
geom_vline(xintercept = seq(2.5,12,2), linetype = "dotted") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 300)) +
scale_x_discrete(expand = c(0, 0)) +
scale_fill_manual(values = fill_col_values) +
scale_color_manual(values = color_values) +
labs(y = "Species richness\n(Chao1)", x = "") +
stat_summary(fun.y = mean,
geom = "crossbar") +
stat_pvalue_manual(stat.test, label = "p.adj", y.position = 270) +
ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
theme(axis.ticks = element_line(size = .5),
legend.position = "none")
ggsave(paste0("../", result.dir, "chao1_species_richness.pdf"), width = 5, height = 3)
According to Recon default settings and tutorial (check
Recon manual), the count files generated previously were used to create
the fitfiles.txt that were used as an input for generating
the diversity_table.txt for all the samples.
To generate the the fitfile for each count file, the bash script used in a for loop was:
#!/bin/sh
set -euo pipefail
FILE=$1
python recon_v2.5.py -R -t 30 -c -o ${FILE}_fitfile.txt $FILE
python recon_v2.5.py -x --x_max 30 -o ${FILE}_plotfile.txt -b error_bar_parameters.txt ${FILE}_fitfile.txt
Then, each fit file was used for generating the
diversity_table.txt with the command:
python recon_v2.5.py -v -D -b error_bar_parameters.txt -o output_diversity_table.txt *rsv_file_to_recon.txt_fitfile.txt
The results from diversity_table.txt for all the samples
were used as input for plotting.
recon_res <- read.table("../data/diversity_index/recon/rsv_output_diversity_table.txt", header = TRUE) %>%
filter(Timepoint != "PV", Timepoint != "Single-cell", Specificity != "PostF") %>%
mutate(Group = plyr::mapvalues(
ID, c(
"E11", "E16", "E17", "E23", "E24",
"E12", "E14", "E18", "E21"
),
c(
"20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
"1-mer", "1-mer", "1-mer", "1-mer"
)
),
Group_specificity = paste(Specificity, Timepoint, Group, sep = "_")) %>%
mutate(Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
ungroup()
write.csv(recon_res, paste0("../", result.dir, "recon_results_plot_values.csv"), row.names = FALSE)
stat.test <- recon_res %>%
wilcox_test(formula = est_0.0D ~ Group_specificity,
comparisons = my_comparisons,
p.adjust.method = "fdr",
paired = FALSE) %>%
add_xy_position()
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
| .y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| est_0.0D | Total_B1_20-mer | Total_B1_1-mer | 5 | 4 | 16 | 0.190 | 0.228 | ns | 275.7082 | Total_B1…. | 1 | 2 |
| est_0.0D | Total_B2_20-mer | Total_B2_1-mer | 5 | 4 | 16 | 0.190 | 0.228 | ns | 304.8120 | Total_B2…. | 3 | 4 |
| est_0.0D | PreF_B1_20-mer | PreF_B1_1-mer | 5 | 4 | 7 | 0.556 | 0.556 | ns | 333.9159 | PreF_B1_…. | 5 | 6 |
| est_0.0D | PreF_B2_20-mer | PreF_B2_1-mer | 5 | 4 | 2 | 0.064 | 0.127 | ns | 363.0197 | PreF_B2_…. | 7 | 8 |
| est_0.0D | DP_B1_20-mer | DP_B1_1-mer | 5 | 4 | 20 | 0.016 | 0.048 |
|
392.1236 | DP_B1_20…. | 9 | 10 |
| est_0.0D | DP_B2_20-mer | DP_B2_1-mer | 5 | 4 | 20 | 0.016 | 0.048 |
|
421.2274 | DP_B2_20…. | 11 | 12 |
recon_res %>%
ggpubr::ggdotplot(x = "Group_specificity", y = "est_0.0D",
fill = "Group",
color = "Group",
group = "Group_specificity",
legend = "none", size = 2) +
geom_vline(xintercept = seq(2.5,12,2), linetype = "dotted") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 300)) +
scale_x_discrete(expand = c(0, 0)) +
scale_fill_manual(values = fill_col_values) +
scale_color_manual(values = color_values) +
labs(y = "Species richness\n(Recon)", x = "") +
stat_summary(fun.y = mean,
geom = "crossbar") +
stat_pvalue_manual(stat.test, label = "p.adj", y.position = 270) +
ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
theme(axis.ticks = element_line(size = .5),
legend.position = "none")
ggsave(paste0("../", result.dir, "recon_species_richness.pdf"), width = 5, height = 3)
Recon program
stat.test <- recon_res %>%
wilcox_test(formula = est_2.0D ~ Group_specificity,
comparisons = my_comparisons,
p.adjust.method = "fdr",
paired = FALSE) %>%
add_xy_position()
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
| .y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| est_2.0D | Total_B1_20-mer | Total_B1_1-mer | 5 | 4 | 16 | 0.190 | 0.570 | ns | 38.90024 | Total_B1…. | 1 | 2 |
| est_2.0D | Total_B2_20-mer | Total_B2_1-mer | 5 | 4 | 14 | 0.413 | 0.619 | ns | 42.92893 | Total_B2…. | 3 | 4 |
| est_2.0D | PreF_B1_20-mer | PreF_B1_1-mer | 5 | 4 | 13 | 0.556 | 0.667 | ns | 46.95762 | PreF_B1_…. | 5 | 6 |
| est_2.0D | PreF_B2_20-mer | PreF_B2_1-mer | 5 | 4 | 12 | 0.730 | 0.730 | ns | 50.98630 | PreF_B2_…. | 7 | 8 |
| est_2.0D | DP_B1_20-mer | DP_B1_1-mer | 5 | 4 | 15 | 0.286 | 0.572 | ns | 55.01499 | DP_B1_20…. | 9 | 10 |
| est_2.0D | DP_B2_20-mer | DP_B2_1-mer | 5 | 4 | 17 | 0.111 | 0.570 | ns | 59.04368 | DP_B2_20…. | 11 | 12 |
recon_res %>%
ggpubr::ggdotplot(x = "Group_specificity", y = "est_2.0D",
fill = "Group",
color = "Group",
group = "Group_specificity",
legend = "none", size = 2) +
geom_vline(xintercept = seq(2.5,12,2), linetype = "dotted") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 45)) +
scale_x_discrete(expand = c(0, 0)) +
scale_fill_manual(values = fill_col_values) +
scale_color_manual(values = color_values) +
labs(y = "Simpsons diversity\n(Recon)", x = "") +
stat_summary(fun.y = mean,
geom = "crossbar") +
stat_pvalue_manual(stat.test, label = "p.adj", y.position = 40) +
ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
theme(axis.ticks = element_line(size = .5),
legend.position = "none")
ggsave(paste0("../", result.dir, "recon_simpson_index.pdf"), width = 5, height = 3)
ls <- list.files("../data/clonotypes", recursive = T, full.names = T)
ls <- ls[grepl("rsv", ls)]
names(ls) <- basename(dirname(ls))
rds_merge <- lapply(ls, readRDS)
rds_merge <- rbindlist(rds_merge, idcol = TRUE, fill = TRUE)
rds_merge <- rds_merge %>%
select(.id, specificity_group, sc_clone_grp, grp, new_name, ID_timepoint, V_SHM, V_errors, CDR3_aa, cdr3_aa, V_gene, J_gene, v_call, j_call) %>%
mutate(
Timepoint = factor(gsub(".*_", "", ID_timepoint), levels = c("PV", "B1", "B2", "Single-cell")),
ID = gsub("_.*", "", ID_timepoint),
database = plyr::mapvalues(.id,
from = c("IMGTrm", "cirelli", "nestor-rm", "individualized"),
to = c("IMGTrm", "Cirelli et al.", "KIMDB", "Individualized")
),
database = factor(database, levels = c("IMGTrm", "Cirelli et al.", "KIMDB", "Individualized")),
Group = plyr::mapvalues(
ID, c(
"E11", "E16", "E17", "E23", "E24",
"E12", "E14", "E18", "E21"
),
c(
"20-mer", "20-mer", "20-mer", "20-mer", "20-mer",
"1-mer", "1-mer", "1-mer", "1-mer"
)
),
cdr3_aa_length = ifelse(is.na(CDR3_aa), nchar(cdr3_aa), nchar(CDR3_aa)),
v_call = ifelse(is.na(v_call), V_gene, v_call),
j_call = ifelse(is.na(j_call), J_gene, j_call)
) %>%
group_by(.id, ID_timepoint) %>%
distinct(new_name, .keep_all = TRUE) %>%
ungroup() %>%
filter(database %in% c("KIMDB", "Individualized"))
rds_merge_stat <- rds_merge %>%
group_by(database) %>%
summarise(SHM_mean = mean(V_SHM), SHM_sd = sd(V_SHM), size = n())
rds_merge_stat
## # A tibble: 2 × 4
## database SHM_mean SHM_sd size
## <fct> <dbl> <dbl> <int>
## 1 KIMDB 4.58 3.43 194554
## 2 Individualized 4.68 3.54 238805
fwrite(rds_merge_stat, paste0("../", result.dir, "merged_databases_shm.csv"), row.names = FALSE, sep = ",")
full_rep_indiv <- readRDS("../data/clonotypes/individualized/processed_clonotypes.rds")
full_rep_indiv <- full_rep_indiv %>%
filter(timepoint != "PV") %>%
distinct(new_name, .keep_all = TRUE) %>%
group_by(timepoint, ID) %>%
summarise(aligned_seq = n())
full_rep_indiv %>%
filter(timepoint != "Single-cell") %>%
write.csv(paste0("../", result.dir, "repertoire_depth.csv"),row.names = FALSE)
plot_depth <- full_rep_indiv %>%
filter(timepoint != "Single-cell") %>%
mutate(timepoint = factor(timepoint, levels = c("PV", "B1", "B2"))) %>%
ggplot(aes(y = aligned_seq, x = timepoint, grp = timepoint)) +
geom_boxplot(outlier.shape = NA, lwd=1) +
geom_jitter(aes(color = ID), size = 5) +
ggpubr::stat_compare_means(paired = TRUE, method = "wilcox", label = "p.format", p.adjust.methods = "fdr") +
ggpubr::stat_compare_means(method = "kruskal") +
labs(x = "Timepoint", y = "# aligned sequences") +
scale_color_viridis_d() +
ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = 1, axis_text_angle = 45, base_rect_size = 1.5)
plot_depth
ggsave(paste0("../", result.dir, "repertoire_depth.pdf"), width = 5, height = 3)
rds_summary <- rds_merge %>%
group_by(database, ID_timepoint, grp) %>%
summarise(
ID = first(ID), Timepoint = first(Timepoint), Group = first(Group),
clonal_size = n(),
database,
sc_clone_grp = first(sc_clone_grp),
V_errors = mean(V_errors),
specificity_group = first(specificity_group),
cdr3_aa_length = mean(cdr3_aa_length),
v_call = first(v_call),
j_call = first(j_call)
) %>%
ungroup() %>%
group_by(database, ID_timepoint) %>%
mutate(clonal_size_rank = dense_rank(dplyr::desc(clonal_size))) %>%
ungroup() %>%
distinct()
rds_summary_noPV <- rds_summary %>%
filter(Timepoint != "PV", Timepoint != "Single-cell")
rds_summary_save <- rds_summary %>%
filter(Timepoint %in% c("Single-cell", "B1","B2"),
database == "Individualized") %>%
group_by(ID_timepoint) %>%
summarise(mean_clonal_size = mean(clonal_size),
median_clonal_size = median(clonal_size),
geom_clonal_size = exp(mean(log(clonal_size))),
unique_clones = sum(clonal_size == 1),
total_clones_detected = n(),
percentage_unique_clones = round(unique_clones/total_clones_detected*100,digits = 2))
rds_summary_save %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
| ID_timepoint | mean_clonal_size | median_clonal_size | geom_clonal_size | unique_clones | total_clones_detected | percentage_unique_clones |
|---|---|---|---|---|---|---|
| E11_B1 | 73.765766 | 24.0 | 23.618580 | 7 | 111 | 6.31 |
| E11_B2 | 65.842324 | 18.0 | 16.139896 | 22 | 241 | 9.13 |
| E11_Single-cell | 1.596491 | 1.0 | 1.310902 | 169 | 228 | 74.12 |
| E12_B1 | 96.802632 | 8.5 | 11.635387 | 5 | 76 | 6.58 |
| E12_B2 | 62.696296 | 9.0 | 9.733452 | 18 | 135 | 13.33 |
| E12_Single-cell | 1.556000 | 1.0 | 1.304827 | 185 | 250 | 74.00 |
| E14_B1 | 79.458333 | 9.0 | 9.683393 | 11 | 96 | 11.46 |
| E14_B2 | 111.530612 | 10.5 | 12.335661 | 29 | 196 | 14.80 |
| E14_Single-cell | 2.041096 | 1.0 | 1.527244 | 139 | 219 | 63.47 |
| E16_B1 | 73.535519 | 14.0 | 15.223779 | 14 | 183 | 7.65 |
| E16_B2 | 118.491228 | 31.5 | 29.123260 | 17 | 228 | 7.46 |
| E16_Single-cell | 1.996094 | 1.0 | 1.489967 | 169 | 256 | 66.02 |
| E17_B1 | 38.460526 | 8.0 | 9.753009 | 8 | 76 | 10.53 |
| E17_B2 | 55.614035 | 6.0 | 7.902730 | 31 | 171 | 18.13 |
| E17_Single-cell | 1.665158 | 1.0 | 1.381309 | 151 | 221 | 68.33 |
| E18_B1 | 41.323529 | 11.0 | 11.772174 | 9 | 102 | 8.82 |
| E18_B2 | 72.662921 | 21.0 | 15.643111 | 23 | 178 | 12.92 |
| E18_Single-cell | 1.641026 | 1.0 | 1.368241 | 135 | 195 | 69.23 |
| E21_B1 | 35.201835 | 5.0 | 6.887328 | 21 | 109 | 19.27 |
| E21_B2 | 45.202312 | 14.0 | 13.455144 | 16 | 173 | 9.25 |
| E21_Single-cell | 1.674208 | 1.0 | 1.385378 | 151 | 221 | 68.33 |
| E23_B1 | 59.675497 | 9.0 | 11.417886 | 19 | 151 | 12.58 |
| E23_B2 | 63.044025 | 13.0 | 14.180265 | 18 | 159 | 11.32 |
| E23_Single-cell | 1.916279 | 1.0 | 1.484490 | 136 | 215 | 63.26 |
| E24_B1 | 55.469231 | 10.0 | 11.115858 | 18 | 130 | 13.85 |
| E24_B2 | 62.310735 | 9.0 | 10.429075 | 29 | 177 | 16.38 |
| E24_Single-cell | 1.829897 | 1.0 | 1.400170 | 137 | 194 | 70.62 |
write.csv(rds_summary_save,
paste0("../", result.dir, "clone_size_mean_median.csv"),row.names = FALSE)
df_all <- data.frame()
for (timepoints in unique(rds_summary_noPV$Timepoint)) {
for (databases in unique(rds_summary_noPV$database)) {
rds_timepoint <- rds_summary_noPV %>%
filter(Timepoint == timepoints, database == databases) %>%
arrange(desc(clonal_size)) %>%
pull(clonal_size)
rds_indiv <- rds_summary_noPV %>%
filter(
database == "Individualized",
Timepoint == timepoints
) %>%
arrange(desc(clonal_size)) %>%
pull(clonal_size)
length(rds_indiv) <- length(rds_timepoint)
df <- data.frame(
size_indiv = rds_indiv,
clonal_size = rds_timepoint,
Timepoint = timepoints,
database = databases
)
df[is.na(df)] <- 0
df_all <- rbind(df, df_all)
}
}
df_all %>%
filter(database != "Individualized") %>%
mutate(database = factor(database, levels = c("KIMDB", "Individualized"))) %>%
ggplot(aes(x = size_indiv, y = clonal_size)) +
geom_point(size = 1) +
geom_abline(intercept = 0, slope = 1, color = "black", linetype = "dashed") +
scale_x_continuous(
trans = pseudo_log_trans(base = 10),
breaks = c(1, 10, 100, 1000, 10000),
labels = expression(10^0, 10^1, 10^2, 10^3, 10^4)
) +
scale_y_continuous(
trans = pseudo_log_trans(base = 10),
breaks = c(1, 10, 100, 1000, 10000),
labels = expression(10^0, 10^1, 10^2, 10^3, 10^4)
) +
labs(y = "Clonal size\n (KIMDB)", x = "Clonal size\n(Individualized database)") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5)) +
facet_wrap(~ Timepoint)
ggsave(paste0("../", result.dir, "individualized_db_comparison.pdf"), width = 12, height = 6)
for(i in c("Individualized", "KIMDB")){
rds_summary_noPV <- rds_summary %>%
# check if we want to filter clonotypes by size or not
filter(database == i) %>%
rbind(., within(., specificity_group <- "Total")) %>%
mutate(v_call = sub("\\*.*","", v_call),
j_call = sub("\\*.*|-.*","", j_call),
clonal_size_log = log10(clonal_size),
Group = factor(Group, levels = c("20-mer", "1-mer")),
specificity_group = factor(specificity_group, levels = c("Total", "PreF", "DP", "PostF"))) %>%
group_by(Group) %>%
mutate(clonal_size_perc = (clonal_size/sum(clonal_size)) * 100) %>%
ungroup() %>%
group_by(Group, specificity_group, v_call, j_call) %>%
summarise(clonal_size_perc = sum(clonal_size_perc)) %>%
filter(specificity_group != "PostF") %>% droplevels()
rds_summary_noPV %>%
ggplot(aes(x= v_call, y = j_call, fill = clonal_size_perc)) +
geom_tile(color = "black") +
scale_fill_viridis_c(option = "viridis", direction = 1) +
scale_y_discrete(position = "right") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.text.x = element_text(angle = 90, size = 8, hjust = 1, vjust = 0.5, face = "bold", colour = "black"),
axis.text.y = element_text(face = "bold", colour = "black", size = 8),
strip.text = element_text(face = "bold"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = "top",
axis.ticks = element_line(size = .5)) +
labs(fill = "Sequence count\n(% RSV repertoire\n per group)", x = "IGHV alleles", y = "IGHJ alleles") +
facet_wrap(~ specificity_group + Group, ncol = 1, strip.position = "left")
ggsave(paste0("../", result.dir,i,"_IGHV-IGHJ_pairing-rsv-specific-sequences_perc.pdf"), width = 9, height = 7)
}
lor_mabs <- read.csv("../data/specificity/LOR_single-cells_characterized.csv")
mabs_lors <- read.csv("../data/single_cell/sc_summary_filtered.csv")
mabs_lors <- mabs_lors %>%
filter(name %in% lor_mabs$well_ID) %>%
mutate(mAbs_ID = plyr::mapvalues(name, from = lor_mabs$well_ID, to = lor_mabs$LOR)) %>%
arrange(mAbs_ID) %>%
relocate(mAbs_ID)
mabs_clonal_rank <- rds_summary
for(i in mabs_lors$name) {
mabs_clonal_rank[[i]] <- ifelse(grepl(i, rds_summary$sc_clone_grp),
mabs_lors[mabs_lors$name == i,]$mAbs_ID,
NA)
}
col_combine <- colnames(mabs_clonal_rank[15:length(mabs_clonal_rank)])
mabs_clonal_rank <- mabs_clonal_rank %>%
mutate(LOR = purrr::invoke(coalesce, across(all_of(col_combine)))) %>%
select(LOR, colnames(mabs_clonal_rank)[! colnames(mabs_clonal_rank) %in% col_combine]) %>%
filter(!is.na(LOR), database == "Individualized", Timepoint == "B2") %>%
arrange(LOR) %>%
mutate(well_ID = plyr::mapvalues(LOR, from = lor_mabs$LOR, to = substr(lor_mabs$well_ID, 1,3))) %>%
filter(ID == well_ID)
mabs_lors <- mabs_lors %>%
mutate(clonal_rank_B2 = plyr::mapvalues(mAbs_ID, from = mabs_clonal_rank$LOR, to = mabs_clonal_rank$clonal_size_rank),
clonal_size_B2 = plyr::mapvalues(mAbs_ID, from = mabs_clonal_rank$LOR, to = mabs_clonal_rank$clonal_size),
clonal_group = plyr::mapvalues(mAbs_ID, from = mabs_clonal_rank$LOR, to = mabs_clonal_rank$grp)) %>%
relocate(mAbs_ID, clonal_rank_B2, clonal_size_B2, clonal_group)
# Fixed manually LOR24 (same as LOR19), LOR37 (not detected B2), LOR40(not detected B2), LOR42 (same clone as LOR01)
mabs_lors %>% write.csv(paste0("../", result.dir,"LOR_mAbs_info.csv"), row.names = F)
mabs_lors <- read.csv(paste0("../results/2022-09-23/LOR_mAbs_info.csv"))
mabs_lors %>%
ggplot(aes(x = clonal_rank_B2, y = clonal_size_B2, color = substr(name,1,3))) +
geom_point(size = 3) +
scale_color_viridis_d() +
scale_y_log10() +
scale_x_log10() +
labs(fill = "Animal ID", x = "Clonal rank\n(log10)", y = "Clonal size\n(log10)") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5))
ggsave(paste0("../", result.dir, "clonal_size_vs_clonal_rank.pdf"), width = 5, height = 3)
rds_summary_noPV <- rds_summary %>%
# If want to remove clonal sizer < 1, do it here.
filter(database == "Individualized", Timepoint != "Single-cell", Timepoint != "PV") %>%
rbind(., within(., specificity_group <- "Total")) %>%
filter(specificity_group != "PostF") %>%
mutate(Group_specificity = paste(specificity_group, Timepoint, Group, sep = "_"),
Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
group_by(ID_timepoint, specificity_group, Group_specificity, Group) %>%
mutate(v_j_calls = paste(v_call,j_call, sep = "_")) %>%
distinct(v_j_calls) %>%
summarise(unique_v_j = n()) %>%
ungroup()
rds_summary_noPV %>% write.csv(paste0("../", result.dir,"unique_HV_HJ_pairing-data.csv"), row.names = F)
stat.test <- rds_summary_noPV %>%
wilcox_test(formula = unique_v_j ~ Group_specificity,
comparisons = my_comparisons,
p.adjust.method = "fdr",
paired = FALSE) %>%
add_xy_position()
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
| .y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| unique_v_j | Total_B1_20-mer | Total_B1_1-mer | 5 | 4 | 14 | 0.413 | 0.496 | ns | 137.040 | Total_B1…. | 1 | 2 |
| unique_v_j | Total_B2_20-mer | Total_B2_1-mer | 5 | 4 | 11 | 0.901 | 0.901 | ns | 150.288 | Total_B2…. | 3 | 4 |
| unique_v_j | PreF_B1_20-mer | PreF_B1_1-mer | 5 | 4 | 4 | 0.176 | 0.264 | ns | 163.536 | PreF_B1_…. | 5 | 6 |
| unique_v_j | PreF_B2_20-mer | PreF_B2_1-mer | 5 | 4 | 0 | 0.016 | 0.048 |
|
176.784 | PreF_B2_…. | 7 | 8 |
| unique_v_j | DP_B1_20-mer | DP_B1_1-mer | 5 | 4 | 20 | 0.016 | 0.048 |
|
190.032 | DP_B1_20…. | 9 | 10 |
| unique_v_j | DP_B2_20-mer | DP_B2_1-mer | 5 | 4 | 18 | 0.064 | 0.127 | ns | 203.280 | DP_B2_20…. | 11 | 12 |
rds_summary_noPV %>%
ggpubr::ggdotplot(x = "Group_specificity", y = "unique_v_j",
fill = "Group",
color = "Group",
group = "Group_specificity",
legend = "none", size = 2) +
geom_vline(xintercept = seq(2.5,12,2), linetype = "dotted") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 200)) +
scale_x_discrete(expand = c(0, 0)) +
scale_fill_manual(values = fill_col_values) +
scale_color_manual(values = color_values) +
labs(y = "HV and HJ unique pairs\n(Count)", x = "") +
stat_summary(fun.y = mean,
geom = "crossbar") +
stat_pvalue_manual(stat.test, label = "p.adj", y.position = 150) +
ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
theme(axis.ticks = element_line(size = .5),
legend.position = "none")
ggsave(paste0("../", result.dir, "v-j-pair_count.pdf"), width = 5, height = 3)
rds_summary_noPV <- rds_summary %>%
# decide if clonal size greater than 1 should be included
filter(database == "Individualized", Timepoint != "Single-cell") %>%
rbind(., within(., specificity_group <- "Total")) %>%
filter(specificity_group != "PostF") %>%
mutate(Group_specificity = paste(specificity_group, Timepoint, Group, sep = "_"),
Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>%
group_by(Group) %>%
mutate(v_j_calls = paste(v_call,j_call, sep = "_")) %>%
distinct(v_j_calls)
list_v_j <- list("20-mer" = rds_summary_noPV$v_j_calls[rds_summary_noPV$Group == "20-mer"],
"1-mer" = rds_summary_noPV$v_j_calls[rds_summary_noPV$Group == "1-mer"])
ggVennDiagram::ggVennDiagram(list_v_j, label_size = 7,
label_alpha = 0,
set_color = "black",
set_size = 9) +
scale_fill_gradient(low = "#F4FAFE", high = "#4981BF") +
scale_color_manual(values = c("#5F90B0", "#D896C1")) +
theme(legend.position = "none")
ggsave(paste0("../", result.dir,"shared-unique-v-j_pairing.pdf"), height = 5, width = 5)
rds_indiv <- rds_summary %>%
filter(
database == "Individualized",
Timepoint != "Single-cell",
Timepoint != "PV"
) %>%
mutate(
LOR = ifelse(grepl(pattern = paste0(lor_mabs$well_ID, collapse = "|"), x = sc_clone_grp), "cloned", "not_cloned"),
LOR = factor(LOR, levels = c("cloned", "not_cloned"), ordered = TRUE)
)
rds_indiv$Timepoint <- droplevels(rds_indiv$Timepoint)
all <- rds_indiv %>%
tidyr::expand(ID, Timepoint, grp) %>%
filter(Timepoint != "Single-cell")
rds_indiv <- rds_indiv %>%
right_join(all) %>%
mutate(
ID_timepoint = paste(ID, Timepoint, sep = "_"),
database = "individualized",
clonal_size = ifelse(is.na(clonal_size), 0, clonal_size)
) %>%
arrange(grp, ID_timepoint) %>%
tidyr::fill(LOR, Group, sc_clone_grp)
NP_cum_test <- rds_indiv %>%
group_by(Group, Timepoint, specificity_group) %>%
summarise(clonal_size = sum(clonal_size)) %>%
ungroup() %>%
tidyr::pivot_wider(values_from = clonal_size, names_from = Timepoint) %>%
filter(Group == "20-mer") %>%
select(-Group) %>% as.matrix()
rownames(NP_cum_test) <- NP_cum_test[,1]
NP_cum_test <- NP_cum_test[-c(2,4),]
NP_cum_test <- apply(NP_cum_test[,-1], FUN = as.numeric, MARGIN = c(1,2))
sol_cum_test <- rds_indiv %>%
group_by(Group, Timepoint, specificity_group) %>%
summarise(clonal_size = sum(clonal_size)) %>%
ungroup() %>%
tidyr::pivot_wider(values_from = clonal_size, names_from = Timepoint) %>%
filter(Group == "1-mer") %>%
select(-Group) %>% as.matrix()
rownames(sol_cum_test) <- sol_cum_test[,1]
sol_cum_test <- sol_cum_test[-c(2,4),]
sol_cum_test <- apply(sol_cum_test[,-1], FUN = as.numeric, MARGIN = c(1,2))
## Chi-square test for nanoparticle group
chisq.test(NP_cum_test, correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: NP_cum_test
## X-squared = 73.484, df = 1, p-value < 2.2e-16
## Chi-square test for soluble group
chisq.test(sol_cum_test, correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: sol_cum_test
## X-squared = 726.86, df = 1, p-value < 2.2e-16
rds_indiv %>%
group_by(Group, Timepoint, specificity_group) %>%
summarise(clonal_size = sum(clonal_size)) %>%
tidyr::drop_na() %>%
filter(specificity_group != "PostF") %>%
ggplot(aes(x = Timepoint, y = clonal_size, fill = specificity_group, group = specificity_group)) +
geom_area(color = "black") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 80000)) +
scale_x_discrete(expand = c(0.02, 0.02)) +
scale_fill_manual(values = c("#F3DDF8", "#FAE1D6")) +
labs(y = "Clonal size") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5)) +
facet_wrap(~Group)
ggsave(paste0("../", result.dir, "rsv_specific_clonal_size_area_plot_spec.pdf"), width = 8, height = 2)
Data is divided by 20-mer, 1-mer, and specificities
rds_indiv_total <- rds_indiv %>%
mutate(specificity_group = "Total")
rds_indiv <- rbind(rds_indiv, rds_indiv_total)
rds_indiv_plot <- rds_indiv %>%
filter(specificity_group != "PostF") %>%
mutate(Group_specificity = paste(specificity_group, Timepoint, Group, sep = "_")) %>%
mutate(Group_specificity = factor(Group_specificity, levels = c("Total_B1_20-mer", "Total_B1_1-mer", "Total_B2_20-mer", "Total_B2_1-mer", "PreF_B1_20-mer", "PreF_B1_1-mer", "PreF_B2_20-mer", "PreF_B2_1-mer", "DP_B1_20-mer", "DP_B1_1-mer", "DP_B2_20-mer", "DP_B2_1-mer"))) %>% ungroup()
stat.test <- rds_indiv_plot %>%
wilcox_test(formula = V_errors ~ Group_specificity,
comparisons = my_comparisons,
p.adjust.method = "fdr",
paired = FALSE) %>%
add_xy_position()
stat.test %>%
kable %>%
kable_styling("striped") %>%
scroll_box(height = "200px")
| .y. | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif | y.position | groups | xmin | xmax |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| V_errors | Total_B1_20-mer | Total_B1_1-mer | 651 | 383 | 119282.5 | 0.246 | 0.492 | ns | 68.320 | Total_B1…. | 1 | 2 |
| V_errors | Total_B2_20-mer | Total_B2_1-mer | 976 | 682 | 349160.0 | 0.088 | 0.265 | ns | 73.504 | Total_B2…. | 3 | 4 |
| V_errors | PreF_B1_20-mer | PreF_B1_1-mer | 205 | 205 | 21279.5 | 0.824 | 0.841 | ns | 78.688 | PreF_B1_…. | 5 | 6 |
| V_errors | PreF_B2_20-mer | PreF_B2_1-mer | 346 | 389 | 69762.5 | 0.391 | 0.587 | ns | 83.872 | PreF_B2_…. | 7 | 8 |
| V_errors | DP_B1_20-mer | DP_B1_1-mer | 413 | 144 | 29401.0 | 0.841 | 0.841 | ns | 89.056 | DP_B1_20…. | 9 | 10 |
| V_errors | DP_B2_20-mer | DP_B2_1-mer | 583 | 250 | 79353.0 | 0.042 | 0.251 | ns | 94.240 | DP_B2_20…. | 11 | 12 |
rds_indiv_plot %>%
ggpubr::ggviolin(x = "Group_specificity", y = "V_errors", fill = "Group_specificity", group = "Group_specificity",
legend = "none") +
geom_boxplot(outlier.shape = NA, width = 0.15, color = "black", alpha = 0.2)+
geom_vline(xintercept = c(4.5, 8.5), linetype = "dotted") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 80)) +
scale_shape_manual(values=NA)+
scale_x_discrete(expand = c(0, 0)) +
scale_fill_manual(values = rep(c("#5F90B0", "#D896C1"), 6)) +
labs(y = "# IGHV nucleotide mutations", x= "") +
stat_pvalue_manual(stat.test, label = "p.adj", y.position = 70) +
ggprism::theme_prism(base_fontface = "plain", border = T, base_line_size = .5, axis_text_angle = 45) +
theme(axis.ticks = element_line(size = .5),
legend.position = "none")
ggsave(paste0("../", result.dir, "rsv_specific_SHM_per_group.pdf"), width = 6, height = 3)
rds_indiv %>%
filter(specificity_group != "PostF") %>%
mutate(
Group_specificity = paste(Group, specificity_group, sep = "_"),
specificity_group = factor(specificity_group, levels = c("Total", "PreF", "DP")),
Timepoint = factor(Timepoint, levels = c("B1", "B2"))
) %>%
ggplot(aes(x = cdr3_aa_length, fill = Group)) +
geom_density(alpha = .7) +
scale_y_continuous(expand = c(0, 0)) +
scale_x_continuous(expand = c(0, 0)) +
scale_fill_manual(values = c("#5F90B0", "#D896C1")) +
labs(y = "Density", x = "HCDR3 aa length") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5)) +
facet_wrap(~ specificity_group + Timepoint, ncol = 3, dir = "v")
ggsave(paste0("../", result.dir, "rsv_specific_CDR3_length.pdf"), width = 6, height = 3)
kimdb <- Biostrings::readDNAStringSet("../data/databases/nestor_rm/V.fasta")
joined_dbs <- c(kimdb, individualized_dbs_sel)
uniq_joined_dbs <- unique(joined_dbs)
uniq_joined_dbs[grepl("\\.", names(uniq_joined_dbs))]
## DNAStringSet object of length 15:
## width seq names
## [1] 296 CAGGTCCAGCTGGTGCAATCCGG...CCGTGTATTACTGTGCAAGAGA E12.IGHV1-NL_1*01...
## [2] 293 GTGGAGCAGCTGGTGGAGTCTGG...CTGTGTATTATTGTGCAAGAGA E12.IGHV3-100*01
## [3] 293 GTGGAGCAGCTGGTGGAGTCTGG...CTGTGTATTATTGTGCAAGAGA E14.IGHV3-100*01_...
## [4] 293 GTGGAGCAGCTGGTGGAGTCTGG...CTGTGTATTATTGTGCAAGAGA E14.IGHV3-100*01_...
## [5] 298 GAAGTGCAGCTGGTGGAGTCTGG...TTGTATTACTGTAGTAGAGAGA E14.IGHV3-NL_15*0...
## ... ... ...
## [11] 296 GAGGTGCAGCTGGCGGAGTCTGG...CCGTGTATTACTGTGCGAGAGA E16.IGHV3-87*02
## [12] 299 CAGGTACAGCTGAAGGAGTCAGG...CCGTGTATTACTGTGCGAGACA E16.IGHV4-NL_23*0...
## [13] 296 CAGGTGCAGCTGCAGGAGTCGGG...CCGTGTATTACTGTGCGAGAGA E18.IGHV4-NL_5*01...
## [14] 299 CAGGTGCAGCTGCAGGAGTCGGG...CCGTGTATTACTGTGCGAGACA E18.IGHV4-NL_33*0...
## [15] 297 GAAGTGCAGCTGGTGGAGTCTGG...CTTGTATTACTGTGCAAAGATA E21.IGHV3-141*01
size_uniq_kimdb <- data.frame(
v_unique = table(substr(names(uniq_joined_dbs[grepl("\\.", names(uniq_joined_dbs))]), 1, 3)),
type = "Not validated"
) %>%
dplyr::rename(
v_count = v_unique.Freq,
ID = v_unique.Var1
)
size_indv_db$type <- "KIMDB"
kimdb_v_indiv <- rbind(size_indv_db, size_uniq_kimdb) %>%
add_row(ID = c("E11", "E17", "E23", "E24"), v_count = rep(0), type = rep("Not validated")) %>%
group_by(ID) %>%
arrange(ID, .by_group = TRUE) %>%
mutate(
diff = v_count - lag(v_count, default = last(v_count)),
diff = ifelse(diff < 0, v_count, diff)
)
kimdb_v_indiv %>%
write.csv(paste0("../",result.dir, "alleles_validated_KIMDB.csv"), row.names = FALSE)
kimdb_v_indiv %>%
mutate(type = factor(type, levels = c("Not validated", "KIMDB"))) %>%
ggplot(aes(x = ID, y = diff, fill = type)) +
geom_col(color = "black") +
scale_fill_viridis_d(direction = -1) +
scale_y_continuous(expand = c(0, 0), limits = c(0, 100)) +
scale_x_discrete(expand = c(0, 0)) +
labs(y = "V alelle counts", x = "Animal ID") +
theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5))
ggsave(paste0("../", result.dir, "indiv_validated_KIMDB_alleles.pdf"), width = 8, height = 6.38)
clonal_tree_data <- clonotype_rsv %>%
ungroup() %>%
distinct(new_name, .keep_all = TRUE) %>%
group_by(specificity_group, sc_clone_grp, grp, ID_timepoint) %>%
add_tally(name = "clonal_size") %>%
ungroup()
mabs_of_interests <- c("LOR21" = "E16_05_A08", "LOR24" = "E16_05_H03")
# read data from heavy_chain
V_genes <- readDNAStringSet("../data/databases/individualized/combined/V.fasta")
J_genes <- readDNAStringSet("../data/databases/individualized/combined/J.fasta")
HC_all_filtered <- clonotype_rsv %>% filter(timepoint == "Single-cell")
# read data from light chain
LV_genes <- readDNAStringSet("../data/databases/cirelli_LC/V.fasta")
LJ_genes <- readDNAStringSet("../data/databases/cirelli_LC/J.fasta")
clono_light_chains <- read.table("../data/clonotypes/light_chain_assigned_clonotypes.tsv", sep = "\t", header = TRUE)
sc_seq_count <- list()
for(i in seq_along(mabs_of_interests)){
print(i)
mab <- mabs_of_interests[i]
if(any(stringr::str_count(names(mab), "LOR") <= 1)){
mab <- mabs_of_interests[i]
mab_name <- names(mab)
}
if(any(stringr::str_count(names(mab), "LOR") > 1)){
mab <- c(mabs_of_interests[i], mabs_of_interests[i+1])
mab_name <- names(mab)}
# create UCA heavy chain
UCA_V <- V_genes[HC_all_filtered$V_gene[HC_all_filtered$name %in% mab]]
UCA_J <- J_genes[HC_all_filtered$J_gene[HC_all_filtered$name %in% mab]]
UCA <- DNAStringSet(paste0(UCA_V,UCA_J))
names(UCA) <- paste0(mab_name,"_UCA")
# select sequences part of the same clonotype
group <- clonal_tree_data %>% filter(stringr::str_detect(sc_clone_grp, mab)) %>% select(grp) %>% unique() %>% pull()
filtered <- clonal_tree_data %>%
filter(grp == group)
# save csv with b cell lineage lineages info
if(any(stringr::str_count(names(mab), "LOR") > 1)){
write.csv(filtered, paste0("../",result.dir,mab_name[1],".csv"),
row.names = FALSE)
}else{
write.csv(filtered, paste0("../",result.dir,mab_name,".csv"),
row.names = FALSE)
}
# deduplicating sequences
fasta <- unique(df_to_fasta(sequence_strings = filtered$VDJ_nt, sequence_name = gsub(":", "_",filtered$new_name), save_fasta = FALSE))
fasta_lc <- unique(df_to_fasta(sequence_strings = filtered$VDJ_nt, sequence_name = gsub(":", "_",filtered$new_name), save_fasta = FALSE))
#adding single cell sequences
sc_seqs <- filtered[!duplicated(filtered[,c('sc_clone_grp','VDJ_nt')]),]
#saving duplicated
if(any(stringr::str_count(names(mab), "LOR") > 1)){
sc_seq_count[[mab_name[1]]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
}else{
sc_seq_count[[mab_name]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
}
sc_seqs <- df_to_fasta(sequence_strings = sc_seqs$VDJ_nt, sequence_name = gsub(":", "_",sc_seqs$new_name), save_fasta = FALSE)
fasta <- c(fasta, sc_seqs, UCA)
fasta <- fasta[unique(names(fasta))]
if(any(stringr::str_count(names(mab), "LOR") > 1)){
writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name[[1]], ".fasta"), append = FALSE, format = "fasta")
}else{
writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name, ".fasta"), append = FALSE, format = "fasta") }
}
## [1] 1
## [1] 2
for(i in seq_along(mabs_of_interests)){
print(i)
mab <- mabs_of_interests[i]
if(any(stringr::str_count(names(mab), "LOR") <= 1)){
mab <- mabs_of_interests[i]
mab_name <- names(mab)
}
if(any(stringr::str_count(names(mab), "LOR") > 1)){
mab <- c(mabs_of_interests[i], mabs_of_interests[i+1])
mab_name <- names(mab)}
# create UCA light chain
UCA_LV <- LV_genes[clono_light_chains$v_call[clono_light_chains$name %in% mab]]
UCA_LJ <- LJ_genes[clono_light_chains$j_call[clono_light_chains$name %in% mab]]
UCA_LC <- DNAStringSet(paste0(UCA_LV,UCA_LJ))
names(UCA_LC) <- paste0(mab_name,"_LC_UCA")
# select light chain clonotypes
group <- clono_light_chains %>% filter(stringr::str_detect(name, mab)) %>% select(grp) %>% unique() %>% pull()
filtered <- clono_light_chains %>%
filter(grp == group, substr(name,1,3) == substr(mab,1,3))
# save csv with b cell lineage lineages info
if(any(stringr::str_count(names(mab), "LOR") > 1)){
write.csv(filtered, paste0("../",result.dir, mab_name[1],"_LC", ".csv"),
row.names = FALSE)
}else{
write.csv(filtered, paste0("../",result.dir,mab_name,"_LC",".csv"),
row.names = FALSE)
}
# save object as BStringset
fasta <- df_to_fasta(sequence_strings = filtered$VDJ_nt, sequence_name = gsub(":", "_",filtered$new_name), save_fasta = FALSE)
# save duplicated values
if(any(stringr::str_count(names(mab), "LOR") > 1)){
sc_seq_count[[paste0(mab_name[1],"_LC")]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
}else{
sc_seq_count[[paste0(mab_name,"_LC")]] <- filtered %>% group_by(VDJ_nt) %>% summarise(duplicated = n(), name = gsub(":", "_",name))
}
fasta <- c(fasta, UCA_LC)
fasta <- fasta[names(fasta)]
if(any(stringr::str_count(names(mab), "LOR") > 1)){
writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name[[1]],"_LC", ".fasta"), append = FALSE, format = "fasta")
}else{
writeXStringSet(fasta, filepath = paste0("../",result.dir, mab_name, "_LC", ".fasta"), append = FALSE, format = "fasta") }
}
## [1] 1
## [1] 2
Do a Multiple Sequence Alignment using MUSCLE (v 5.1)
for all the fasta files generated through out this analysis. Following
that, run FastTree (v 2.1.11) for all the aligned sequences
and save tree output to be plotted on the following code.
source ~/.bash_profile
DIR_DATE=$(date +'%Y-%m-%d')
cd ../results/$DIR_DATE/
for f in *.fasta; do muscle -align $f -output $f.aln; done
for f in *.aln; do fasttree -nt -gtr $f > $f.tre; done
##
## muscle 5.1.osx64 [] 34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
##
## Input: 217 seqs, avg length 364, max 370
##
## 00:00 4.4Mb CPU has 8 cores, running 8 threads
## 00:00 4.6Mb 0.0043% Calc posteriors
00:01 143Mb 0.52% Calc posteriors
00:02 184Mb 1.6% Calc posteriors
00:03 200Mb 2.7% Calc posteriors
00:04 222Mb 3.9% Calc posteriors
00:05 243Mb 5.0% Calc posteriors
00:06 286Mb 6.1% Calc posteriors
00:07 309Mb 7.2% Calc posteriors
00:08 349Mb 8.3% Calc posteriors
00:09 366Mb 9.3% Calc posteriors
00:10 381Mb 10.4% Calc posteriors
00:11 399Mb 11.5% Calc posteriors
00:12 404Mb 12.6% Calc posteriors
00:13 430Mb 13.6% Calc posteriors
00:14 460Mb 14.7% Calc posteriors
00:15 484Mb 15.7% Calc posteriors
00:16 509Mb 16.8% Calc posteriors
00:17 549Mb 17.8% Calc posteriors
00:18 596Mb 18.8% Calc posteriors
00:19 633Mb 19.8% Calc posteriors
00:20 674Mb 20.8% Calc posteriors
00:21 692Mb 21.8% Calc posteriors
00:22 726Mb 22.8% Calc posteriors
00:23 767Mb 23.8% Calc posteriors
00:24 788Mb 24.8% Calc posteriors
00:25 802Mb 25.8% Calc posteriors
00:26 825Mb 26.8% Calc posteriors
00:27 832Mb 27.8% Calc posteriors
00:28 842Mb 28.7% Calc posteriors
00:29 844Mb 29.7% Calc posteriors
00:30 856Mb 30.6% Calc posteriors
00:31 762Mb 31.6% Calc posteriors
00:32 773Mb 32.6% Calc posteriors
00:33 778Mb 33.6% Calc posteriors
00:34 785Mb 34.6% Calc posteriors
00:35 797Mb 35.6% Calc posteriors
00:36 804Mb 36.7% Calc posteriors
00:37 820Mb 37.6% Calc posteriors
00:38 840Mb 38.6% Calc posteriors
00:39 852Mb 39.6% Calc posteriors
00:40 861Mb 40.6% Calc posteriors
00:41 868Mb 41.6% Calc posteriors
00:42 768Mb 42.6% Calc posteriors
00:43 781Mb 43.6% Calc posteriors
00:44 791Mb 44.6% Calc posteriors
00:45 812Mb 45.5% Calc posteriors
00:46 836Mb 46.5% Calc posteriors
00:47 864Mb 47.6% Calc posteriors
00:48 871Mb 48.6% Calc posteriors
00:49 894Mb 49.6% Calc posteriors
00:50 763Mb 50.6% Calc posteriors
00:51 776Mb 51.6% Calc posteriors
00:52 783Mb 52.6% Calc posteriors
00:53 798Mb 53.7% Calc posteriors
00:54 809Mb 54.7% Calc posteriors
00:55 702Mb 55.7% Calc posteriors
00:56 715Mb 56.7% Calc posteriors
00:57 732Mb 57.7% Calc posteriors
00:58 747Mb 58.7% Calc posteriors
00:59 760Mb 59.7% Calc posteriors
01:00 779Mb 60.7% Calc posteriors
01:01 795Mb 61.7% Calc posteriors
01:02 826Mb 62.7% Calc posteriors
01:03 844Mb 63.7% Calc posteriors
01:04 869Mb 64.8% Calc posteriors
01:05 874Mb 65.8% Calc posteriors
01:06 768Mb 66.8% Calc posteriors
01:07 800Mb 67.8% Calc posteriors
01:08 828Mb 68.8% Calc posteriors
01:09 846Mb 69.9% Calc posteriors
01:10 872Mb 70.9% Calc posteriors
01:11 764Mb 72.0% Calc posteriors
01:12 766Mb 73.0% Calc posteriors
01:13 776Mb 74.0% Calc posteriors
01:14 780Mb 75.0% Calc posteriors
01:15 798Mb 76.1% Calc posteriors
01:16 811Mb 77.1% Calc posteriors
01:17 829Mb 78.2% Calc posteriors
01:18 845Mb 79.2% Calc posteriors
01:19 861Mb 80.2% Calc posteriors
01:20 876Mb 81.3% Calc posteriors
01:21 896Mb 82.3% Calc posteriors
01:22 909Mb 83.4% Calc posteriors
01:23 929Mb 84.4% Calc posteriors
01:24 945Mb 85.5% Calc posteriors
01:25 966Mb 86.5% Calc posteriors
01:26 987Mb 87.5% Calc posteriors
01:27 1.0Gb 88.6% Calc posteriors
01:28 1.0Gb 89.6% Calc posteriors
01:29 1.0Gb 90.6% Calc posteriors
01:30 1.1Gb 91.7% Calc posteriors
01:31 1.1Gb 92.7% Calc posteriors
01:32 1.1Gb 93.7% Calc posteriors
01:33 853Mb 94.8% Calc posteriors
01:34 870Mb 95.9% Calc posteriors
01:35 877Mb 97.0% Calc posteriors
01:36 894Mb 98.0% Calc posteriors
01:37 938Mb 98.9% Calc posteriors
01:38 961Mb 99.7% Calc posteriors
01:38 961Mb 100.0% Calc posteriors
## 01:38 961Mb 0.0043% Consistency (1/2)
01:39 970Mb 5.4% Consistency (1/2)
01:40 991Mb 17.8% Consistency (1/2)
01:41 1.0Gb 31.0% Consistency (1/2)
01:42 1.0Gb 43.9% Consistency (1/2)
01:43 1.1Gb 56.9% Consistency (1/2)
01:44 1.1Gb 71.8% Consistency (1/2)
01:45 1.1Gb 85.8% Consistency (1/2)
01:45 1.1Gb 100.0% Consistency (1/2)
## 01:45 1.1Gb 0.0043% Consistency (2/2)
01:46 1.1Gb 0.91% Consistency (2/2)
01:47 1.1Gb 17.0% Consistency (2/2)
01:48 1.1Gb 33.6% Consistency (2/2)
01:49 1.1Gb 49.4% Consistency (2/2)
01:50 1.1Gb 64.6% Consistency (2/2)
01:51 1.1Gb 80.0% Consistency (2/2)
01:52 1.1Gb 95.4% Consistency (2/2)
01:52 1.1Gb 100.0% Consistency (2/2)
## 01:52 1.1Gb 0.46% UPGMA5
01:52 1.1Gb 100.0% UPGMA5
## 01:52 1.1Gb 1.0% Refining
01:53 1.1Gb 19.0% Refining
01:54 1.1Gb 52.0% Refining
01:55 1.1Gb 84.0% Refining
01:55 1.1Gb 100.0% Refining
##
## muscle 5.1.osx64 [] 34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
##
## Input: 8 seqs, avg length 322, max 326
##
## 00:00 2.1Mb CPU has 8 cores, running 8 threads
## 00:00 2.4Mb 3.6% Calc posteriors
00:00 77Mb 100.0% Calc posteriors
## 00:00 77Mb 3.6% Consistency (1/2)
00:00 78Mb 100.0% Consistency (1/2)
## 00:00 78Mb 3.6% Consistency (2/2)
00:00 78Mb 100.0% Consistency (2/2)
## 00:00 78Mb 14.3% UPGMA5
00:00 78Mb 100.0% UPGMA5
## 00:00 78Mb 1.0% Refining
00:00 78Mb 100.0% Refining
##
## muscle 5.1.osx64 [] 34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
##
## Input: 118 seqs, avg length 370, max 370
##
## 00:00 2.8Mb CPU has 8 cores, running 8 threads
## 00:00 3.0Mb 0.014% Calc posteriors
00:01 114Mb 0.48% Calc posteriors
00:02 195Mb 3.7% Calc posteriors
00:03 240Mb 7.1% Calc posteriors
00:04 253Mb 10.3% Calc posteriors
00:05 266Mb 13.5% Calc posteriors
00:06 273Mb 16.7% Calc posteriors
00:07 291Mb 19.9% Calc posteriors
00:08 310Mb 23.1% Calc posteriors
00:09 338Mb 26.4% Calc posteriors
00:10 376Mb 29.6% Calc posteriors
00:11 413Mb 32.7% Calc posteriors
00:12 473Mb 35.8% Calc posteriors
00:13 509Mb 39.0% Calc posteriors
00:14 556Mb 42.1% Calc posteriors
00:15 609Mb 45.2% Calc posteriors
00:16 658Mb 48.3% Calc posteriors
00:17 691Mb 51.5% Calc posteriors
00:18 748Mb 54.6% Calc posteriors
00:19 764Mb 57.7% Calc posteriors
00:20 789Mb 60.6% Calc posteriors
00:21 807Mb 63.5% Calc posteriors
00:22 819Mb 66.5% Calc posteriors
00:23 851Mb 69.5% Calc posteriors
00:24 874Mb 72.5% Calc posteriors
00:25 779Mb 75.5% Calc posteriors
00:26 794Mb 78.6% Calc posteriors
00:27 821Mb 81.7% Calc posteriors
00:28 839Mb 84.8% Calc posteriors
00:29 876Mb 88.0% Calc posteriors
00:30 895Mb 91.0% Calc posteriors
00:31 912Mb 94.1% Calc posteriors
00:32 925Mb 97.1% Calc posteriors
00:33 942Mb 100.0% Calc posteriors
00:33 942Mb 100.0% Calc posteriors
## 00:33 942Mb 0.014% Consistency (1/2)
00:34 991Mb 94.0% Consistency (1/2)
00:34 994Mb 100.0% Consistency (1/2)
## 00:34 994Mb 0.014% Consistency (2/2)
00:35 998Mb 92.5% Consistency (2/2)
00:35 998Mb 100.0% Consistency (2/2)
## 00:35 998Mb 0.85% UPGMA5
00:35 998Mb 100.0% UPGMA5
## 00:35 998Mb 1.0% Refining
00:36 1.0Gb 90.0% Refining
00:36 1.0Gb 100.0% Refining
##
## muscle 5.1.osx64 [] 34.4Gb RAM, 8 cores
## Built Feb 22 2022 02:38:35
## (C) Copyright 2004-2021 Robert C. Edgar.
## https://drive5.com
##
## Input: 10 seqs, avg length 331, max 334
##
## 00:00 2.1Mb CPU has 8 cores, running 8 threads
## 00:00 2.3Mb 2.2% Calc posteriors
00:00 88Mb 100.0% Calc posteriors
## 00:00 88Mb 2.2% Consistency (1/2)
00:00 88Mb 100.0% Consistency (1/2)
## 00:00 88Mb 2.2% Consistency (2/2)
00:00 88Mb 100.0% Consistency (2/2)
## 00:00 88Mb 11.1% UPGMA5
00:00 88Mb 100.0% UPGMA5
## 00:00 88Mb 1.0% Refining
00:00 89Mb 100.0% Refining
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR21.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.05 seconds
## Refining topology: 31 rounds ME-NNIs, 2 rounds ME-SPRs, 16 rounds ML-NNIs
## 0.10 seconds: SPR round 1 of 2, 101 of 432 nodes
## 0.20 seconds: SPR round 1 of 2, 401 of 432 nodes
## 0.32 seconds: SPR round 2 of 2, 301 of 432 nodes
## Total branch-length 2.197 after 0.38 sec
## 0.45 seconds: ML NNI round 1 of 16, 101 of 215 splits, 27 changes (max delta 1.450)
## ML-NNI round 1: LogLk = -5720.973 NNIs 57 max delta 7.09 Time 0.51
## 0.58 seconds: Optimizing GTR model, step 3 of 12
## 0.68 seconds: Optimizing GTR model, step 6 of 12
## 0.78 seconds: Optimizing GTR model, step 10 of 12
## GTR Frequencies: 0.2176 0.2444 0.3116 0.2264
## GTR rates(ac ag at cg ct gt) 1.2448 1.4803 0.7188 0.5852 1.4823 1.0000
## 0.90 seconds: ML Lengths 201 of 215 splits
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.816 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## 1.03 seconds: ML NNI round 2 of 16, 101 of 215 splits, 38 changes (max delta 0.001)
## ML-NNI round 2: LogLk = -5299.576 NNIs 67 max delta 0.00 Time 1.11
## Turning off heuristics for final round of ML NNIs (converged)
## 1.21 seconds: ML NNI round 3 of 16, 101 of 215 splits, 40 changes (max delta 0.000)
## ML-NNI round 3: LogLk = -5298.548 NNIs 61 max delta 1.02 Time 1.32 (final)
## 1.32 seconds: ML Lengths 1 of 215 splits
## Optimize all lengths: LogLk = -5298.540 Time 1.37
## 1.48 seconds: ML split tests for 100 of 214 internal splits
## 1.59 seconds: ML split tests for 200 of 214 internal splits
## Total time: 1.61 seconds Unique: 217/217 Bad splits: 0/214
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR21_LC.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.00 seconds
## Refining topology: 10 rounds ME-NNIs, 2 rounds ME-SPRs, 5 rounds ML-NNIs
## Total branch-length 0.080 after 0.00 sec
## ML-NNI round 1: LogLk = -612.010 NNIs 1 max delta 0.13 Time 0.00
## GTR Frequencies: 0.2670 0.2510 0.2459 0.2361
## GTR rates(ac ag at cg ct gt) 0.1273 14.1848 2.8987 3.9092 3.0704 1.0000
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.642 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## ML-NNI round 2: LogLk = -587.982 NNIs 0 max delta 0.00 Time 0.01
## Turning off heuristics for final round of ML NNIs (converged)
## ML-NNI round 3: LogLk = -587.981 NNIs 0 max delta 0.00 Time 0.02 (final)
## Optimize all lengths: LogLk = -587.981 Time 0.02
## Total time: 0.02 seconds Unique: 6/8 Bad splits: 0/3
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR24.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.02 seconds
## Refining topology: 28 rounds ME-NNIs, 2 rounds ME-SPRs, 14 rounds ML-NNIs
## 0.10 seconds: ME NNI round 10 of 28, 1 of 116 splits
## Total branch-length 1.596 after 0.18 sec
## 0.27 seconds: ML NNI round 1 of 14, 101 of 116 splits, 29 changes (max delta 5.999)
## ML-NNI round 1: LogLk = -4057.963 NNIs 35 max delta 6.00 Time 0.29
## 0.38 seconds: Optimizing GTR model, step 7 of 12
## GTR Frequencies: 0.2038 0.2859 0.3002 0.2101
## GTR rates(ac ag at cg ct gt) 1.0761 0.9774 0.5841 0.4792 1.2071 1.0000
## 0.49 seconds: ML Lengths 101 of 116 splits
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.789 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## 0.59 seconds: ML NNI round 2 of 14, 101 of 116 splits, 15 changes (max delta 0.290)
## ML-NNI round 2: LogLk = -3817.538 NNIs 18 max delta 1.05 Time 0.61
## ML-NNI round 3: LogLk = -3814.285 NNIs 1 max delta 3.25 Time 0.65
## ML-NNI round 4: LogLk = -3814.285 NNIs 0 max delta 0.00 Time 0.68
## Turning off heuristics for final round of ML NNIs (converged)
## 0.76 seconds: ML NNI round 5 of 14, 101 of 116 splits, 0 changes
## ML-NNI round 5: LogLk = -3814.284 NNIs 0 max delta 0.00 Time 0.77 (final)
## Optimize all lengths: LogLk = -3814.284 Time 0.80
## 0.91 seconds: ML split tests for 100 of 115 internal splits
## Total time: 0.93 seconds Unique: 118/118 Bad splits: 0/115
## FastTree Version 2.1.11 Double precision (No SSE3)
## Alignment: LOR24_LC.fasta.aln
## Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
## Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
## TopHits: 1.00*sqrtN close=default refresh=0.80
## ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories
## Initial topology in 0.00 seconds
## Refining topology: 13 rounds ME-NNIs, 2 rounds ME-SPRs, 6 rounds ML-NNIs
## Total branch-length 0.104 after 0.00 sec
## ML-NNI round 1: LogLk = -699.140 NNIs 1 max delta 0.00 Time 0.01
## GTR Frequencies: 0.2026 0.3061 0.2592 0.2321
## GTR rates(ac ag at cg ct gt) 0.7128 4.0992 1.9319 0.9442 0.6308 1.0000
## Switched to using 20 rate categories (CAT approximation)
## Rate categories were divided by 0.648 so that average rate = 1.0
## CAT-based log-likelihoods may not be comparable across runs
## Use -gamma for approximate but comparable Gamma(20) log-likelihoods
## ML-NNI round 2: LogLk = -675.314 NNIs 1 max delta 0.00 Time 0.02
## Turning off heuristics for final round of ML NNIs (converged)
## ML-NNI round 3: LogLk = -675.314 NNIs 0 max delta 0.00 Time 0.03 (final)
## Optimize all lengths: LogLk = -675.314 Time 0.03
## Total time: 0.04 seconds Unique: 9/10 Bad splits: 0/6
Generated trees arre edited using treeio and plotted
with ggtree. The trees were rerooted to their respectives
Unmutated Common Ancestor (UCA), which in this case is defined as just
the V and J gene germlines combined. The gap between V-J is inserted
automatically by the alignment method, thus the CDR3 here is not
considered for the UCA.
# function to root tree in UCA
.to_root_uca <- function(x){
root(x,which(grepl("UCA",x[["tip.label"]])))
}
ls <- list.files(paste0("../",result.dir), pattern = "*\\.tre", full.names = TRUE)
names(ls) <- lapply(ls, function(x) {
if (stringr::str_count(x, "LOR") > 1) {
substr(x, 24, 34)
}else if (grepl("LC",x)) {
substr(x, 24, 31)
}else if (grepl("LOR",x)) {
substr(x, 24, 28)
}else{
substr(x, 24,33)
}})
trees <- lapply(ls, read.tree)
trees_rerooted <- lapply(trees, .to_root_uca)
plots <- lapply(trees_rerooted, ggtree)
for(i in seq_along(plots)){
print(i)
plot_name <- names(plots)[i]
plots_edit <- plots[[i]]$data %>%
mutate(new_label = ifelse(grepl("sc_|LOR",label), gsub("sc_", "",label), ""),
new_label = plyr::mapvalues(new_label,from = lor_mabs$well_ID, to = lor_mabs$LOR,
warn_missing = FALSE),
new_label = ifelse(grepl("LOR",new_label), new_label, ""),
name = label,
timepoint = case_when(grepl("B2-igg",name) ~ "B2",
grepl("B1-igg",name) ~ "B1",
grepl("sc_|LOR",name) ~ "single_cell",
grepl("igm",name) ~ "PV",
TRUE ~ "intersects"),
name = gsub("sc_", "", label))
if(stringr::str_count(plot_name, "LOR") > 1){
plots_edit <- left_join(plots_edit, sc_seq_count[[paste0(plot_name,1)]], by = "name")
}else{
plots_edit <- left_join(plots_edit, sc_seq_count[[plot_name]], by = "name")
}
# shapes_timepoints <- c("PV" = 18, "B1" = 17, "B2" = 16, "single_cell" = 4)
colors_timepoints <- c("PV" = "red", "intersects" = "black", "B1" = "#66c2a5", "B2" = "#fc8d62", "single_cell" = "#fc8d62")
gg_plot <- ggtree(plots_edit,aes(color = timepoint)) +
{if(grepl("LC", plot_name))geom_tippoint(shape = 18, size = 1)
else geom_tippoint(aes(size = duplicated), shape = 18)}+
# geom_tippoint(aes(size = duplicated), shape = 23) +
geom_tiplab(aes(label = new_label), hjust = -.2) +
labs(size = "Count", shape = "Shape", color = "Timepoint") +
scale_color_manual(values = colors_timepoints) +
coord_cartesian(clip = 'off') +
# scale_shape_manual(values = shapes_timepoints) +
scale_size_area(limits = c(1,25), breaks = c(1,5,10,15,25), max_size = 3)+
geom_treescale(width = .05)
gg_plot
ggsave(gg_plot, filename = paste0("../", result.dir, names(ls)[[i]], "_tree.pdf"), width = ifelse(grepl("LC", plot_name), 3, 4), height = ifelse(grepl("LC", plot_name), 3, 6))
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
Query LOR24 amino acid sequence in bulk repertoire sequencing from each animal. Query was done without alignment due to large number of sequences, but string distance was calculated using Levenstein distance.
The entire dataset was loaded, filtered and used to calculate
distances to LOR24. The output was generated and saved in
.rds format to avoid recaculation for each run.
processed_data <- readRDS("../data/clonotypes/individualized/processed_clonotypes.rds")
processed_data <- processed_data %>% select(grp, name, ID, timepoint, new_name, name, V_SHM, VDJ_aa)
LOR24aa <- "QLQLQESGPGLVKSSETLPLTCAVSGDSISSSYWSWIRQAPGKGLEWIGYIYGSGSYSHYNPSLKSRVTLSVDTSKNQFFLRLNSVTVADTAVYYCARGGRGNTYSWNRFDVWGPGVLVTVSS"
# calculate string distance between LOR24 and all the sequences
identity_matrix <- stringdist::stringdist(gsub("\\*", "",processed_data$VDJ_aa), LOR24aa, method = "lv", nthread = 8)
# calculate the percentage of identity, normalize by LOR24 length
processed_data <- processed_data %>%
mutate(identity = (1-identity_matrix/nchar(LOR24aa)) * 100) %>%
select(V_SHM, identity, ID)
saveRDS(processed_data, "../data/clonotypes/individualized/lor24_identity_calculation.rds")
The processed dataset above was used as an input to plot a 2d-histogram. This plot includes the somatic hypermutation percentage vs the identity to LOR24 amino acid sequences. In essence, this plot shows how a set of sequences across different non-human primates were evolving to become more identical to LOR24.
processed_data <- readRDS("../data/clonotypes/individualized/lor24_identity_calculation.rds")
# plot a 2d-histogram of mutations and identity to LOR24
processed_data %>%
filter(identity >= 70) %>%
ggplot(aes(x = V_SHM, y = identity)) +
geom_bin2d(bins = 30) +
scale_fill_viridis_c(trans = "log10") +
theme_prism() +
# scale_y_continuous(expand=c(0,1),limits=c(0,101)) +
labs(x = "% Divergence from germline", y = paste0("% Identity to LOR24aa"),
fill = "Sequence counts\n(log10)") +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5),
aspect.ratio = 1,
legend.title = element_text(size = 12)) +
facet_wrap(~ID)
ggsave(filename = paste0("../", result.dir,"LOR24_2d-histogram.pdf"))
LOR mAbs had a diverse competition profiles against different standardized mAbs while binding to PostF and PreF RSV fusion proteins.
#edited the raw values to have the max value as the reference competition for ADI, MPE8, 101F, D25
data_comp_auc <- read.csv("../data/ELISA_comp/2023-03-10_LOR_norm-dAUC.csv", header = T, row.names = 1) %>%
mutate(specificity = factor(specificity, levels = c("PreF", "PreF/PostF", "PostF")),
epitope = factor(epitope, levels = c("Ø", "III", "IV", "I/IV", "II","I", "UK-Pre", "UK-DP", "PostF", "foldon", "WB")))
data_comp_auc$EC50_preF <- log10(data_comp_auc$EC50_preF)
data_comp_auc$EC50_postF <- log10(data_comp_auc$EC50_postF)
# change here the columns you want to remove from the heatmap
to_remove <- c("EC50_postF","EC50_preF", "epitope", "specificity")
annot_colors <- list(specificity = c("PreF" = "#F7D586",
"PreF/PostF" = "#CD87F8",
"PostF" = "#92CDD6"), epitope = c(
"Ø" = "#F3B084",
"III" = "#A9D08D",
"IV" = "#FAD0FF",
"I/IV" = "#cbc9f3",
"II" = "#FFD966",
"I" = "#9BC1E6",
"foldon" = "black",
"UK-Pre" = "#C9C9C9",
"UK-DP" = "#8497B0" ,
"WB" = "black",
"PostF" = "black"))
{
g_heatmap_scale <- data_comp_auc %>%
select(-all_of(c(to_remove))) %>%
mutate(across(.cols = everything(), .fns = function(x) pmax(x,0))) %>%
t() %>%
pheatmap::pheatmap(scale = "none",cutree_cols = 8, angle_col = 90, fontsize_row = 8, clustering_method = "ward.D", color = viridisLite::viridis(100), cluster_rows = FALSE, border_color = "grey40", cluster_cols = TRUE, annotation_col = data_comp_auc[,11:12], annotation_colors = annot_colors)
ggsave(g_heatmap_scale,filename = paste0("../", result.dir, "/", "LOR_heatmap_auc_wardD.pdf"), width = 24, height = 10, units = "cm")
}
This MDS is another way to visualize the same data showed on the heatmap above. Here, the MDS input is the euclidean distance. The data was scaled and centered prior calculating their euclidean distance, which is the most used and simplest distance calculation.
# check which mabs should be included and added
to_habillage <- factor(rownames(data_comp_auc), levels = c(paste0("LOR", sprintf(fmt = "%02d",seq(1:106)))))
data_comp_auc <- data_comp_auc %>% tibble::rownames_to_column("name")
# compute MDS
mds_scaled <- data_comp_auc %>%
select(-all_of(c(to_remove, "name"))) %>%
scale(center = TRUE, scale = TRUE) %>%
dist(method = "euclidean") %>%
cmdscale() %>%
as_tibble()
colnames(mds_scaled) <- c("Dim.1", "Dim.2")
# Plot MDS
mds_scaled$name <- to_habillage
p1 <- mds_scaled %>%
left_join(data_comp_auc) %>%
ggplot(aes(Dim.1,Dim.2, label = name, color = epitope)) +
geom_point(size = 2) +
ggrepel::geom_text_repel(max.overlaps = 5) +
scale_color_manual(values = annot_colors[["epitope"]]) +
ggprism::theme_prism(base_fontface = "plain", border = TRUE, base_line_size = 1) +
theme(axis.ticks = element_line(size = .5),
aspect.ratio = 1,
legend.position = "none")
plotly::ggplotly(p1)
ggsave(width = 5, height = 3,filename = paste0("../", result.dir, "mds_euclidean-distance-plus-binding.pdf"))
renv::snapshot()
sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS/LAPACK: /Users/rodrigoarcoverde/opt/anaconda3/envs/rsv_np_repertoire/lib/libopenblasp-r0.3.21.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] kableExtra_1.3.4 dplyr_1.1.0 ggprism_1.0.4
## [4] scales_1.2.1 vegan_2.6-4 lattice_0.20-45
## [7] permute_0.9-7 data.table_1.14.8 Biostrings_2.66.0
## [10] GenomeInfoDb_1.34.8 XVector_0.38.0 IRanges_2.32.0
## [13] S4Vectors_0.36.0 BiocGenerics_0.44.0 ggpubr_0.6.0
## [16] ggplot2_3.4.1 rstatix_0.7.2 ggtree_3.6.0
## [19] treeio_1.22.0 tidyr_1.3.0 jsonlite_1.8.4
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-162 bitops_1.0-7 sf_1.0-7
## [4] RColorBrewer_1.1-3 webshot_0.5.4 httr_1.4.5
## [7] tools_4.2.2 backports_1.4.1 bslib_0.4.2
## [10] utf8_1.2.3 R6_2.5.1 KernSmooth_2.23-20
## [13] DBI_1.1.3 lazyeval_0.2.2 mgcv_1.8-42
## [16] colorspace_2.1-0 withr_2.5.0 tidyselect_1.2.0
## [19] compiler_4.2.2 cli_3.6.0 rvest_1.0.3
## [22] xml2_1.3.3 plotly_4.10.1 labeling_0.4.2
## [25] sass_0.4.5 classInt_0.4-9 proxy_0.4-27
## [28] systemfonts_1.0.4 stringr_1.5.0 digest_0.6.31
## [31] yulab.utils_0.0.6 rmarkdown_2.20 svglite_2.1.1
## [34] pkgconfig_2.0.3 htmltools_0.5.4 fastmap_1.1.1
## [37] highr_0.10 htmlwidgets_1.6.1 rlang_1.0.6
## [40] rstudioapi_0.14 ggVennDiagram_1.2.2 gridGraphics_0.5-1
## [43] jquerylib_0.1.4 generics_0.1.3 farver_2.1.1
## [46] crosstalk_1.2.0 car_3.1-1 RCurl_1.98-1.10
## [49] magrittr_2.0.3 ggplotify_0.1.0 GenomeInfoDbData_1.2.9
## [52] patchwork_1.1.2 Matrix_1.5-3 Rcpp_1.0.10
## [55] munsell_0.5.0 fansi_1.0.4 ape_5.7
## [58] abind_1.4-5 lifecycle_1.0.3 stringi_1.7.12
## [61] yaml_2.3.7 carData_3.0-5 MASS_7.3-58.3
## [64] zlibbioc_1.44.0 plyr_1.8.8 grid_4.2.2
## [67] ggrepel_0.9.3 parallel_4.2.2 crayon_1.5.2
## [70] splines_4.2.2 knitr_1.42 pillar_1.8.1
## [73] ggsignif_0.6.4 glue_1.6.2 evaluate_0.20
## [76] ggfun_0.0.9 vctrs_0.5.2 gtable_0.3.1
## [79] purrr_1.0.1 cachem_1.0.7 xfun_0.37
## [82] broom_1.0.4 e1071_1.7-13 tidytree_0.4.2
## [85] class_7.3-21 RVenn_1.1.0 viridisLite_0.4.1
## [88] pheatmap_1.0.12 tibble_3.2.0 aplot_0.1.10
## [91] units_0.8-1 cluster_2.1.4 ellipsis_0.3.2